feat(backend): Q5_K packed matmul + ARM NEON + Kotlin/Native cinterop path by michalharakal · Pull Request #734 · SKaiNET-developers/SKaiNET

michalharakal · 2026-06-11T12:09:15Z

Adds a first-class Q5_K packed in-kernel dequant-matmul to the CPU backend (it was previously only eagerly decoded to FP32), hand-written ARM NEON kernels, and a Kotlin/Native cinterop consumption path so the kernels run on the board binary (not just the JVM via FFM).

What's here

Q5_K (256-elt / 176-byte super-block): TensorEncoding.Q5_K, Q5_KTensorData/Q5_KBlockTensorData (5th-bit fold from qh), Q5KMatmulKernel SPI, scalar (commonMain) + Panama (JVM) + native-C kernels, DefaultCpuOps dispatch + lazy transpose, and a StreamingGgufParametersLoader Q5_K/Q6_K packed branch.
ARM NEON (behind #if __ARM_NEON; x86 keeps the scalar fallback): fp32, q8_0, q4k, q5k. CMake aarch64 branch -march=armv8.2-a+fp16+dotprod (no +i8mm — A55 lacks it). Cross toolchain + opt-in -PcrossArm64.
Kotlin/Native cinterop: CMake now also builds a static archive libskainet_kernels.a; linuxX64 + linuxArm64 targets with a shared nativeMain; NativeKn*MatmulKernel + NativeKnKernelProvider (+ installNativeKernels()) so K/N resolves the C kernels through KernelRegistry.

Verification

Q5_K bit-exact vs the DequantOps golden across blocks; native↔Panama↔scalar matmul parity; capability-matrix gate updated.
Kotlin/Native: cinterop kernel ↔ scalar parity + registry resolution green on linuxX64 (6 tests); compileKotlinLinuxArm64 + cinterop cross-compile from x86.
JVM/FFM path unchanged.

Board-verify-pending

The NEON paths are aarch64-syntax-validated (clang --target=aarch64) but not executed (x86 host, no QEMU). The final aarch64 binary link + NEON runtime parity need the SL2610 (or QEMU).

🤖 Generated with Claude Code

Adds Q5_K as a packed in-kernel dequant-matmul format (previously Q5_K was only eagerly decoded to FP32 on load), mirroring the existing Q4_K plumbing, and hand-written ARM NEON paths for the native CPU kernels. Q5_K (256-elt / 176-byte super-block: d, dMin, 12 packed scales, 32-byte qh high-bit plane, 128-byte qs low nibbles; 5-bit code = lowNibble | (5th<<4)): - TensorEncoding.Q5_K; Q5_KTensorData / Q5_KBlockTensorData (5th-bit fold). - Q5KMatmulKernel SPI + matmulQ5K()/"Q5_K" in KernelProvider.supports(). - ScalarQ5_KMatmulKernel (commonMain/KN), PanamaVectorQ5_KMatmulKernel (JVM), native C skainet_q5k_matmul + NativeQ5KMatmulKernel (FFM); all registered. - DefaultCpuOps matmul dispatch + lazy-transpose branches. - StreamingGgufParametersLoader: Q5_K + Q6_K packed branches (a Q5_K_M GGUF now loads end-to-end instead of SKIP'ing most tensors). Tests: Q5_KBlockTensorData bit-exact vs DequantOps golden across blocks; native<->Panama<->scalar matmul parity; KernelSupportMatrixTest gate updated. ARM NEON (behind #if __ARM_NEON in skainet_simd.h; x86 keeps the scalar fallback, re-verified green): - fp32 (broadcast+vfmaq), q8_0 (widen int8->f32+vfmaq), q4k/q5k (nibble unpack + dual code/input accumulators; q5k folds the qh 5th bit via a runtime-count vshlq_u8). - CMake aarch64 branch: -march=armv8.2-a+fp16+dotprod (no +i8mm — A55 lacks it). Cross toolchain-aarch64.cmake + opt-in -PcrossArm64 gradle tasks; default x86 build unaffected. BOARD-VERIFY-PENDING: the NEON paths are aarch64-syntax-validated (clang --target=aarch64) but not executed (x86 host, no QEMU). Run the parity tests under qemu-aarch64 or on the SL2610 before relying on them. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The hand-written matmul kernels were JVM-only (consumed via FFM), but the SL2610 board binary is Kotlin/Native — it can't use the FFM wrapper. Add a K/N consumption path via cinterop so the board gets the same C (and, on aarch64, NEON) kernels. - CMake builds a STATIC archive (skainet_kernels_static -> libskainet_kernels.a) alongside the SHARED lib; same sources + flags (incl. the aarch64 NEON march). - cinterop .def (skainet_kernels.h -> sk.ainet.kernels.cinterop). - linuxX64 target on the (previously jvm-only) module, linking the static archive into K/N binaries; link tasks depend on the CMake build. - NativeKnQ5KMatmulKernel (linuxX64Main): calls skainet_q5k_matmul via cinterop with pinned arrays (zero-copy). POC verified on the host (linuxX64): NativeKnQ5KMatmulKernelParityTest — the cinterop kernel matches the commonMain ScalarQ5_KMatmulKernel across 4 shapes (tests=4, failures=0). JVM/FFM path unchanged (jvmTest green). linuxArm64 board target + NEON runtime check are the remaining step. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The K/N analogue of the JVM NativeKernelProvider (FFM): a KernelProvider (priority 100) exposing the cinterop-backed Q5_K/Q4_K/Q8_0/Q4_0 matmul kernels, plus installNativeKernels() to register it in KernelRegistry — the path the eager runtime's DefaultCpuOps.chooseQuantizedMatmulHeap uses to resolve a kernel. K/N has no ServiceLoader, so registration is an explicit call by the consumer (scalar fallback for Q6_K etc. is registered separately from skainet-backend-cpu). Verified on linuxX64: NativeKnKernelProviderTest — installNativeKernels makes native-cinterop the best-available provider, its Q5_K kernel is the registry-resolved kernel, and it matches the scalar reference (6 K/N tests green total). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Promote the K/N cinterop path from the linuxX64 POC to the real board target: - linuxArm64 target with the same skainet_kernels cinterop; links the aarch64 cross-built static archive (cmake-build-arm64/libskainet_kernels.a, NEON). - Shared `nativeMain` source set holds NativeKn*MatmulKernel + the provider, so linuxX64 and linuxArm64 share one implementation (cinterop bindings are commonized across both targets). - linuxArm64 link tasks depend on the aarch64 cross-build only under -PcrossArm64 (toolchain present); a plain host build still compiles linuxArm64 to a klib. Verified on host: compileKotlinLinuxArm64 + cinteropSkainetKernelsLinuxArm64 succeed (cross-compiled from x86); linuxX64Test still green (6 tests) on the shared nativeMain. Final aarch64 binary link + NEON runtime are board-verify-pending. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-11T12:11:48Z

📖 Documentation Preview

The documentation has been built successfully for this PR.

Generated Files:

Operator documentation: docs/modules/operators/_generated_/
JSON schema output: operators.json

Artifacts:

Download the documentation-preview-734 artifact to view the complete documentation locally.

This comment will be updated automatically when the PR is updated.

michalharakal and others added 4 commits June 10, 2026 23:13

michalharakal mentioned this pull request Jun 11, 2026

feat(gemma): eager Q5_K packed path + Kotlin/Native board load path SKaiNET-developers/SKaiNET-transformers#176

Merged

michalharakal merged commit 92485f2 into develop Jun 11, 2026
14 checks passed

michalharakal deleted the feature/q5k-neon-kernels branch June 11, 2026 12:51

michalharakal mentioned this pull request Jun 13, 2026

release: 0.30.0 #735

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(backend): Q5_K packed matmul + ARM NEON + Kotlin/Native cinterop path#734

feat(backend): Q5_K packed matmul + ARM NEON + Kotlin/Native cinterop path#734
michalharakal merged 4 commits into
developfrom
feature/q5k-neon-kernels

michalharakal commented Jun 11, 2026

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michalharakal commented Jun 11, 2026

What's here

Verification

Board-verify-pending

Uh oh!

github-actions Bot commented Jun 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant